This tutorial provides a brief introduction into mapping linguistic data using R. That means we’ll be working with regional data and we’ll want to map features to help us understand regional distribution of language varieties. This can be useful for dialectology, but is also used in dialectometry and NLP approaches.
In preparation for our maps, we’ll need to load a couple of packages. For mapping we’ll need ‘maps’ and for some optional maps ‘rworldmap’, which also loads ‘sp’. Depending on your version of R you might also need ‘broom’. The main work for the maps will be done using ggplot, so we need the tidyverse package as well.
library(maps) # to get US maps
library(rworldmap) # mapping other country outlines
library(tidyverse) # making pretty maps
library(sf) # to change the geo-information to suitable format
The data we’ll be using is based on a collection of 1 billion Tweets / 9 billion words. All Tweets are geocoded American Tweets collected between 2013 and 2014. From this a US Twitter swearing dataset was compiled. See Huang et al. 2016; and Grieve et al. 2017 for more information.
The initial step now is reading in the dataset.
norm_swear <- read.table("BSLSS_SWEAR.txt", header = TRUE, sep = ",")
The basic dimension of the dataset is 52 swear words measured across 3,085 locations, denoted by state plus county (= 53).
dim(norm_swear)
## [1] 3085 53
The locations are coded as state-county pairs. These are the first 15 rows of our dataset.
head(norm_swear, 15)
## county ass asshole bastard bitch bitched bitchy bloody
## 1 alabama,autauga 1520421 49600 9538 962106 6995 8903 5087
## 2 alabama,baldwin 1246775 54318 6578 807348 2334 7851 14004
## 3 alabama,barbour 2263661 29188 3243 959948 3243 6486 3243
## 4 alabama,bibb 1451192 14629 2926 1009398 0 8777 0
## 5 alabama,blount 559433 72969 4230 506556 2115 5288 3173
## 6 alabama,bullock 2168413 56605 0 1184354 0 8708 0
## 7 alabama,butler 2638306 38680 11282 1806683 6447 4835 3223
## 8 alabama,calhoun 1604872 38763 8012 917534 2166 5197 4115
## 9 alabama,chambers 1881425 34756 5902 1438120 1312 1967 20329
## 10 alabama,cherokee 380377 37028 1683 272660 1683 6732 5049
## 11 alabama,chilton 1202164 58668 6400 1122162 5333 6400 6400
## 12 alabama,choctaw 1398501 26458 0 653894 0 7559 3780
## 13 alabama,clarke 1405435 50611 7786 809780 0 7786 0
## 14 alabama,clay 1322939 35436 7087 668557 0 14174 2362
## 15 alabama,cleburne 629833 49888 6236 433400 0 18708 15590
## bullshit cock crap crappy cunt damn damnit damned darn dick
## 1 120184 15897 146255 13354 22892 1206925 19077 8903 13990 210481
## 2 98452 7002 109910 10397 9124 907073 9760 10185 10185 113729
## 3 74591 3243 113507 19458 3243 1258310 0 25945 22701 136209
## 4 105328 0 90699 8777 2926 1176168 17555 2926 17555 93625
## 5 101523 9518 201988 6345 9518 469543 13748 23266 8460 59222
## 6 182878 8708 21771 4354 0 1240960 0 0 4354 300443
## 7 164390 9670 46738 4835 0 1513359 3223 11282 14505 267537
## 8 95500 8446 74711 6713 10395 1027543 14292 14292 12344 147689
## 9 155419 7869 55741 3935 8525 1080065 13116 8525 10492 172469
## 10 40394 1683 121182 15148 3366 336617 5049 11782 6732 38711
## 11 129070 5333 120536 13867 13867 802154 9600 22401 11734 120536
## 12 56696 0 22678 3780 0 578299 0 0 15119 49137
## 13 66184 7786 147941 3893 3893 1140699 0 3893 31145 89543
## 14 96858 4725 203166 14174 4725 1030002 11812 14174 21262 153555
## 15 112247 12472 159017 15590 9354 654777 24944 21826 21826 102893
## dickhead douche douchebag dumbass dyke fag faggot fatass freaking friggin
## 1 3179 14626 6359 43241 2544 42605 40697 6359 167876 2544
## 2 2971 18884 5729 29069 2546 19521 15489 4031 170593 2546
## 3 0 3243 3243 22701 0 9729 0 0 175126 3243
## 4 2926 5852 0 11703 2926 2926 8777 0 187251 8777
## 5 9518 13748 4230 25381 0 16920 32783 4230 195643 5288
## 6 0 4354 0 52251 17417 13063 52251 0 47897 0
## 7 0 4835 1612 29010 0 6447 12893 1612 78972 1612
## 8 650 7363 2815 18840 3898 11044 9312 3032 104378 2599
## 9 1967 656 0 26231 0 9181 9837 8525 127221 1312
## 10 1683 6732 1683 15148 0 1683 10099 0 107717 13465
## 11 1067 16000 3200 27734 3200 11734 13867 0 151471 2133
## 12 0 0 0 3780 0 0 0 0 64255 11339
## 13 0 3893 0 58398 3893 3893 7786 0 179086 7786
## 14 0 7087 14174 25986 2362 9450 14174 2362 179542 2362
## 15 0 12472 3118 12472 3118 18708 6236 6236 149663 9354
## fuck fucked fucker fuckery fucking goddamn gosh hell hoe homo
## 1 1441570 212388 21620 6359 592017 5087 69948 695667 268347 17169
## 2 1137714 139191 15065 2546 462767 5941 81690 573101 252920 7426
## 3 1115615 158910 12972 3243 376196 9729 64861 901573 369710 3243
## 4 1351715 236989 20481 17555 833850 8777 38035 506162 298431 2926
## 5 775168 88832 12690 3173 319374 4230 132191 379653 131134 1058
## 6 1941993 278672 21771 8708 574760 13063 30480 735867 335277 0
## 7 2427177 328781 17728 14505 676902 27398 48350 862244 515735 1612
## 8 1305163 188184 12127 8879 401056 7363 52189 747323 293645 6063
## 9 1800109 220997 17706 5246 445273 15739 58364 908252 398713 3935
## 10 464532 50493 33662 5049 323152 1683 42077 373645 69006 5049
## 11 1123229 161071 24534 7467 652817 2133 97069 489613 378676 1067
## 12 1190616 200326 7559 7559 302379 7559 18899 585859 238123 7559
## 13 774741 124581 7786 3893 288095 19466 81757 654053 140154 0
## 14 1256792 148831 4725 7087 344909 2362 49610 850461 203166 4725
## 15 645423 90422 9354 0 280619 3118 71714 545647 109129 0
## jackass motherfucker motherfucking nigger piss pissed pissy pussy shit
## 1 13354 12082 3179 5087 69948 169148 11446 197763 2352169
## 2 4668 4456 2546 2971 71081 152770 4456 103969 1733094
## 3 0 19458 3243 3243 64861 139452 6486 204313 2085293
## 4 0 5852 5852 0 67293 201880 0 152141 2390371
## 5 2115 7403 3173 0 102580 155457 10575 32783 905244
## 6 4354 30480 0 13063 65314 117565 4354 296089 3239557
## 7 0 9670 4835 0 78972 262702 11282 269149 3932477
## 8 1083 7579 7146 3032 57170 159816 4331 161332 2190864
## 9 3279 21641 11148 1967 53774 172469 2623 255097 2863124
## 10 10099 0 0 0 42077 72373 1683 26929 540270
## 11 7467 3200 3200 2133 87469 147204 13867 76802 1930716
## 12 0 0 3780 0 30238 143630 0 105833 2260280
## 13 19466 7786 0 0 105116 147941 0 70077 1666277
## 14 0 9450 0 0 61422 170092 2362 148831 2093078
## 15 3118 3118 3118 0 77950 96658 6236 34298 1172362
## shittiest shitty slut slutty twat whore
## 1 1908 52779 37518 3179 3179 46420
## 2 3395 41163 31615 5305 3819 32676
## 3 9729 22701 32431 0 0 25945
## 4 0 20481 29258 0 0 29258
## 5 2115 43359 35956 4230 2115 45474
## 6 4354 65314 21771 4354 0 13063
## 7 0 35457 19340 6447 4835 24175
## 8 866 25337 19273 3248 1083 20573
## 9 656 8525 16394 1967 1312 28198
## 10 1683 18514 10099 1683 6732 38711
## 11 0 43734 28801 7467 2133 55468
## 12 0 3780 11339 0 0 11339
## 13 0 15573 23359 3893 0 50611
## 14 0 16537 14174 4725 2362 7087
## 15 3118 9354 9354 6236 3118 40534
For each county, Grieve et al. (2017) measured the relative frequency per billion words of the word in all the Tweets originating from that county by dividing the frequency of that word in those Tweets by the total number of words in those Tweets and multiplying the product by 1 billion. These swear words are all in the top 10,000 most frequent word types in the corpus. Here is a summary of the swear words.
summary(norm_swear[, 2:ncol(norm_swear)])
## ass asshole bastard bitch
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 633706 1st Qu.: 42612 1st Qu.: 4940 1st Qu.: 522422
## Median : 861821 Median : 63864 Median : 9371 Median : 727234
## Mean :1017433 Mean : 67841 Mean : 11091 Mean : 790277
## 3rd Qu.:1266972 3rd Qu.: 86219 3rd Qu.: 13983 3rd Qu.: 997230
## Max. :8904228 Max. :567215 Max. :310376 Max. :7340226
## bitched bitchy bloody bullshit
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 2679 1st Qu.: 3599 1st Qu.: 84137
## Median : 3385 Median : 6801 Median : 8789 Median :111607
## Mean : 4892 Mean : 8674 Mean : 11888 Mean :113840
## 3rd Qu.: 6305 3rd Qu.: 11120 3rd Qu.: 14831 3rd Qu.:139169
## Max. :508411 Max. :283607 Max. :591876 Max. :714967
## cock crap crappy cunt
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 5282 1st Qu.: 60791 1st Qu.: 4989 1st Qu.: 7375
## Median : 11404 Median : 88122 Median : 9890 Median : 17099
## Mean : 14377 Mean : 98436 Mean : 11988 Mean : 21012
## 3rd Qu.: 17611 3rd Qu.:124043 3rd Qu.: 14915 3rd Qu.: 28555
## Max. :1242999 Max. :821355 Max. :244499 Max. :435954
## damn damnit damned darn
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 578299 1st Qu.: 6920 1st Qu.: 4672 1st Qu.: 8763
## Median : 742309 Median : 14030 Median : 8998 Median : 13989
## Mean : 794217 Mean : 17003 Mean : 11526 Mean : 17009
## 3rd Qu.: 944342 3rd Qu.: 22559 3rd Qu.: 14355 3rd Qu.: 20566
## Max. :3846951 Max. :368363 Max. :235349 Max. :263481
## dick dickhead douche douchebag
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 106690 1st Qu.: 0 1st Qu.: 10956 1st Qu.: 0
## Median : 152438 Median : 837 Median : 20780 Median : 4606
## Mean : 158934 Mean : 2405 Mean : 28157 Mean : 7044
## 3rd Qu.: 199728 3rd Qu.: 2874 3rd Qu.: 37733 3rd Qu.: 9234
## Max. :1426300 Max. :65772 Max. :357483 Max. :275330
## dumbass dyke fag faggot
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 16771 1st Qu.: 0 1st Qu.: 9901 1st Qu.: 9834
## Median : 25801 Median : 1280 Median : 18972 Median : 20574
## Mean : 28587 Mean : 2748 Mean : 22735 Mean : 25157
## 3rd Qu.: 35497 3rd Qu.: 3747 3rd Qu.: 29998 3rd Qu.: 34166
## Max. :301841 Max. :133627 Max. :301341 Max. :308339
## fatass freaking friggin fuck
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 83361 1st Qu.: 0 1st Qu.: 982852
## Median : 2750 Median :118324 Median : 3148 Median :1392858
## Mean : 3992 Mean :129845 Mean : 5194 Mean :1429525
## 3rd Qu.: 5219 3rd Qu.:161718 3rd Qu.: 6041 3rd Qu.:1835999
## Max. :157093 Max. :900328 Max. :307630 Max. :9527509
## fucked fucker fuckery fucking
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 116560 1st Qu.: 12704 1st Qu.: 0 1st Qu.: 498194
## Median : 173033 Median : 21614 Median : 1763 Median : 724606
## Mean : 177267 Mean : 25471 Mean : 3784 Mean : 771294
## 3rd Qu.: 232068 3rd Qu.: 32636 3rd Qu.: 5178 3rd Qu.: 991071
## Max. :1133503 Max. :261505 Max. :277937 Max. :4075971
## goddamn gosh hell hoe
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2484 1st Qu.: 47709 1st Qu.: 406799 1st Qu.: 64499
## Median : 10121 Median : 72283 Median : 498316 Median : 110292
## Mean : 12654 Mean : 82519 Mean : 531068 Mean : 155838
## 3rd Qu.: 17097 3rd Qu.: 103681 3rd Qu.: 613539 3rd Qu.: 200547
## Max. :231535 Max. :2601908 Max. :2770083 Max. :1949566
## homo jackass motherfucker motherfucking
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 4362 1st Qu.: 0
## Median : 6559 Median : 4370 Median : 10143 Median : 3647
## Mean : 7817 Mean : 5617 Mean : 11714 Mean : 4903
## 3rd Qu.: 10300 3rd Qu.: 7205 3rd Qu.: 15371 3rd Qu.: 6449
## Max. :276932 Max. :154447 Max. :382482 Max. :236967
## nigger piss pissed pissy
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 55306 1st Qu.:125064 1st Qu.: 0
## Median : 3021 Median : 71603 Median :160520 Median : 4475
## Mean : 4693 Mean : 75461 Mean :167956 Mean : 7000
## 3rd Qu.: 6201 3rd Qu.: 91188 3rd Qu.:204546 3rd Qu.: 8865
## Max. :295300 Max. :475602 Max. :747938 Max. :293600
## pussy shit shittiest shitty
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 59076 1st Qu.: 1207801 1st Qu.: 0 1st Qu.: 41936
## Median : 98155 Median : 1608277 Median : 2902 Median : 70336
## Mean : 121016 Mean : 1753821 Mean : 4029 Mean : 76185
## 3rd Qu.: 154377 3rd Qu.: 2174110 3rd Qu.: 5868 3rd Qu.:102097
## Max. :1488628 Max. :12309084 Max. :69425 Max. :550661
## slut slutty twat whore
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 24019 1st Qu.: 0 1st Qu.: 0 1st Qu.: 25511
## Median : 38600 Median : 4738 Median : 3116 Median : 37334
## Mean : 43288 Mean : 5727 Mean : 4634 Mean : 41494
## 3rd Qu.: 55727 3rd Qu.: 7935 3rd Qu.: 6107 3rd Qu.: 52540
## Max. :547945 Max. :150670 Max. :130014 Max. :547945
Before we map our swearing data, we need to understand the basics of cartography in R.
First, we need to get a map of the US, which we will format and use as a base to plot our swear word relative frequencies on to. There are several stages to setting up a nice map. Aside from the first step though, which just involves reading in the underlying map, they’re all optional.
First, we need to get a US map. Fortunately working with US data is very easy in R, since all the necessary maps can be accessed in library(maps). We use ggplot’s map_data function to extract the relevant information from the package.
usa <- map_data("usa")
Now we’ll have a look at the very basic US map. For this we’ll need ggplot:
ggplot() +
geom_polygon(data = usa,
aes(x = long, y = lat, group = region))
If you want to map other countries, you can download and read in the base mapping data (e.g. shapefiles), which are available from various different sources. This is especially interesting if you’re looking to work with administrative regions and the like. For country outlines, you can also use library(rworldmaps). This example below shows how to produce a map of Germany, Austria and Switzerland (= German-Speaking Area, GSA). rworldmap works with coordinates.
What this code chunk does is getting the world map and then creating a list of three countries by name. Then we create a map based on that list and in the next step we get the coordinates of those countries, so that we can use these for mapping.
worldMap <- getMap()
GSA <- c("Germany", "Austria", "Switzerland")
GSA_map <- which(worldMap$NAME%in%GSA)
GSA_coord <- lapply(GSA_map, function(i){
df <- data.frame(worldMap@polygons[[i]]@Polygons[[1]]@coords)
df$region = as.character(worldMap$NAME[i])
colnames(df) <- list("long", "lat", "region")
return(df)
})
GSA_coord <- do.call("rbind", GSA_coord)
After this, we can have a look at our three countries using ggplot. The coord_fixed argument makes sure that the relationship between x and y is correct; it fixes the aspect ratio.
gsa_map <- ggplot() +
geom_polygon(data = GSA_coord,
aes(x = long, y = lat, group = region)) +
# this bit does the aspect ratio fix
coord_fixed(1.3)
gsa_map
Now, we need to make sure our US data can be mapped, which means we don’t just need the outline of the US, but we need the counties. We can extract them from our maps package.
counties <- map_data("county")
ggplot() +
geom_polygon(data = counties,
aes(x = long, y = lat, group = group),
# to see the counties we add a colour for outline and filling
color = "black", fill = "lightgrey",
size = .1 ) +
coord_fixed(1.3)
Now that we have a basic map of the US, we can make it look a bit nicer, so that subsequent maps are easier to read.
ggplot() +
geom_polygon(data = counties,
aes(x = long, y = lat, group = group),
color = "black", fill = "white",
size = .1 ) +
coord_fixed(1.3) +
theme_minimal() + # sets the theme for the plot
ggtitle("US Map with Counties") + # gives the plot a title
theme(axis.title.x = element_blank(), # removes x axis title, here longitude
axis.title.y = element_blank(),# removes y axis title, here latitude
axis.text.x = element_blank(), # removes x axis text, here coordinates
axis.text.y = element_blank(), # removes y axis text, here coordinates
panel.grid.major = element_blank(), # removes grid lines
panel.grid.minor = element_blank(), # removes grid lines
plot.title = element_text(hjust = 0.5)) # centres title
Now that we have a base map and our data read in, we need to make sure the data can be mapped. This might look a bit complicated, but what we’re doing is getting the coordinate data that we need to join our existing dataset.
First, we get a map of the counties (aka the geo-information we need) and save it as us_geo (and have a little look, colourful!). For this we need the package ‘sf’. We’re still using the same “maps” library as before, but since each county has multiple sets of coordinates, we need a format that can be matched to our dataset, where each location is just one row, hence we’re handling it with ‘sf’. We merge the two separate lists into one using dplyr.
us_geo <- st_as_sf(maps::map(database = "county",
plot = FALSE,
fill = TRUE))
plot(us_geo)
us_geo_swear <- us_geo %>%
left_join(norm_swear,
by = c("ID" = "county"))
If you have a look at the new data frame us_geo_swear, you can see that it is essentially the same list as before, but that the last column contains another list, as every county has multiple coordinate points, which we need for plotting.
# shows us that it is a data frame
class(us_geo_swear)
## [1] "sf" "data.frame"
# you can see that we now have a data frame that contains multipolygons
head(us_geo_swear)
## Simple feature collection with 6 features and 53 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
## Geodetic CRS: WGS 84
## ID ass asshole bastard bitch bitched bitchy bloody
## 1 alabama,autauga 1520421 49600 9538 962106 6995 8903 5087
## 2 alabama,baldwin 1246775 54318 6578 807348 2334 7851 14004
## 3 alabama,barbour 2263661 29188 3243 959948 3243 6486 3243
## 4 alabama,bibb 1451192 14629 2926 1009398 0 8777 0
## 5 alabama,blount 559433 72969 4230 506556 2115 5288 3173
## 6 alabama,bullock 2168413 56605 0 1184354 0 8708 0
## bullshit cock crap crappy cunt damn damnit damned darn dick
## 1 120184 15897 146255 13354 22892 1206925 19077 8903 13990 210481
## 2 98452 7002 109910 10397 9124 907073 9760 10185 10185 113729
## 3 74591 3243 113507 19458 3243 1258310 0 25945 22701 136209
## 4 105328 0 90699 8777 2926 1176168 17555 2926 17555 93625
## 5 101523 9518 201988 6345 9518 469543 13748 23266 8460 59222
## 6 182878 8708 21771 4354 0 1240960 0 0 4354 300443
## dickhead douche douchebag dumbass dyke fag faggot fatass freaking friggin
## 1 3179 14626 6359 43241 2544 42605 40697 6359 167876 2544
## 2 2971 18884 5729 29069 2546 19521 15489 4031 170593 2546
## 3 0 3243 3243 22701 0 9729 0 0 175126 3243
## 4 2926 5852 0 11703 2926 2926 8777 0 187251 8777
## 5 9518 13748 4230 25381 0 16920 32783 4230 195643 5288
## 6 0 4354 0 52251 17417 13063 52251 0 47897 0
## fuck fucked fucker fuckery fucking goddamn gosh hell hoe homo
## 1 1441570 212388 21620 6359 592017 5087 69948 695667 268347 17169
## 2 1137714 139191 15065 2546 462767 5941 81690 573101 252920 7426
## 3 1115615 158910 12972 3243 376196 9729 64861 901573 369710 3243
## 4 1351715 236989 20481 17555 833850 8777 38035 506162 298431 2926
## 5 775168 88832 12690 3173 319374 4230 132191 379653 131134 1058
## 6 1941993 278672 21771 8708 574760 13063 30480 735867 335277 0
## jackass motherfucker motherfucking nigger piss pissed pissy pussy shit
## 1 13354 12082 3179 5087 69948 169148 11446 197763 2352169
## 2 4668 4456 2546 2971 71081 152770 4456 103969 1733094
## 3 0 19458 3243 3243 64861 139452 6486 204313 2085293
## 4 0 5852 5852 0 67293 201880 0 152141 2390371
## 5 2115 7403 3173 0 102580 155457 10575 32783 905244
## 6 4354 30480 0 13063 65314 117565 4354 296089 3239557
## shittiest shitty slut slutty twat whore geom
## 1 1908 52779 37518 3179 3179 46420 MULTIPOLYGON (((-86.50517 3...
## 2 3395 41163 31615 5305 3819 32676 MULTIPOLYGON (((-87.93757 3...
## 3 9729 22701 32431 0 0 25945 MULTIPOLYGON (((-85.42801 3...
## 4 0 20481 29258 0 0 29258 MULTIPOLYGON (((-87.02083 3...
## 5 2115 43359 35956 4230 2115 45474 MULTIPOLYGON (((-86.9578 33...
## 6 4354 65314 21771 4354 0 13063 MULTIPOLYGON (((-85.66866 3...
# If you open the data frame and scroll to the last column,
# you can see the list in the list.
view(us_geo_swear)
Now that the data is prepared, we can try and map some swear words. Note that we’ve added geom_sf to the plot. We do this because it can handle the sf data we’ve added for the geolocation of our swear words. That also means we don’t need geom_polygon, but by the name you can tell it has similar functionality.
This first map is a very basic choropleth map based on our variable “ass”:
ggplot() +
geom_sf(data = us_geo_swear,
aes(fill = ass))
Let’s add our design to it:
ggplot() +
geom_sf(data = us_geo_swear,
aes(fill = ass)) +
theme_minimal() +
ggtitle("'Ass' Distribution in the US per County") +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
That looks sort of like what we want, so let’s rework it a bit. Note that we divide the occurrences of ‘ass’ by 10.000, since we’re dealing with high numbers we can thus make our graph easier to read this way.
ggplot() +
geom_sf(data = us_geo_swear,
aes(fill = ass / 10000),
lwd = 0.1, # lwd sets the outline thickness of the polygons
color = "grey") + # this sets the outline colour
theme_minimal() +
ggtitle("'Ass' Distribution in the US per County") +
# this adds a new legend title with line break \n
guides(fill = guide_legend(title = "Distribution \nin 10.000")) +
# here we start using some nicer colours
scale_fill_continuous(low = "white",
high = "mediumpurple4") +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title = element_text(size = 8))
We can see that there seems to be a trend towards ass in the Southeast. Let’s see if we can see some more trends.
ggplot() +
geom_sf(data = us_geo_swear,
aes(fill = dickhead / 10000),
lwd = 0.1,
color = "grey") +
theme_minimal() +
ggtitle("'Dickhead' Distribution in the US per County") +
guides(fill = guide_legend(title = "Distribution \nin 10.000")) +
scale_fill_continuous(low = "white",
high = "mediumpurple4") +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title = element_text(size = 8))
How about fuck, but in green?
ggplot() +
geom_sf(data = us_geo_swear,
aes(fill = fuck / 10000),
lwd = 0.1,
color = "grey") +
theme_minimal() +
ggtitle("'Fuck' Distribution in the US per County") +
guides(fill = guide_legend(title = "Distribution \nin 10.000")) +
scale_fill_continuous(low = "white",
high = "aquamarine4") + # green this time?
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title= element_text(size = 8))
In the next step for the swearing maps we’ll implement quantiles. What that means is we split the relative frequency distribution for the word we want to map into intervals. We’re using “quantile” style intervals here, where the values are split so each interval contains a roughly equal number of values, although the range of each interval will likely vary (often considerably).
In order to do this we’ll first pick a swear word, it’s location and create a new list. Then we’ll calculate the quantiles for our swear word and add this as a factor to our list. Exchange the swear word in this code to run it with a different one.
# select the columns you need
quant_swear <- us_geo_swear %>%
select(bitch, geom)
# calculate quantiles
q <- quantile(quant_swear$bitch,
na.rm = TRUE)
# add factor given the quantiles to our list
quant_swear$quant <- factor(findInterval(quant_swear$bitch, q))
Now we can map our data. Instead of filling the polygons by the frequency of our swear word, we use the quantiles we’ve just defined. Note that that means we’re going from continuous scale colours to discrete, so we need to change the colouring option of our map. That’s why we first define these colours.
cols <- c("1" = "white",
"2" = "lightsteelblue1",
"3" = "lightsteelblue2",
"4" = "lightsteelblue3",
"5" = "lightsteelblue4")
ggplot() +
# we've added na.omit to not have NAs plotted
geom_sf(data = na.omit(quant_swear),
aes(fill = quant),
lwd = 0.1,
color = "grey") +
# here we pass our colour list
scale_colour_manual(values = cols,
#and say we use it to fill
aesthetics = c("colour", "fill")) +
theme_minimal() +
ggtitle("'Bitch' Quantile Distribution in the US") +
guides(fill = guide_legend(title = "Quantiles")) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title = element_text(size = 8))
Let’s map the quantiles of another swear word and change the colours for the map. If you want to play around with colour yourself, this website offers a good overview.
quant_swear <- us_geo_swear %>% select(shit, geom)
q <- quantile(quant_swear$shit, na.rm = TRUE)
quant_swear$quant <- factor(findInterval(quant_swear$shit,q))
cols <- c("1" = "white",
"2" = "rosybrown1",
"3" = "rosybrown2",
"4" = "rosybrown3",
"5" = "rosybrown4")
ggplot() +
geom_sf(data = na.omit(quant_swear),
aes(fill = quant),
lwd = 0.1,
color = "grey") +
scale_colour_manual(values = cols,
aesthetics = c("colour", "fill")) +
theme_minimal() +
ggtitle("'Shit' Quantile Distribution in the US") +
guides(fill = guide_legend(title = "Quantiles")) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title = element_text(size = 8))
As the last bit, we’ll try out adding another layer to our ggplot maps. Remember our map for the German-speaking area.
gsa_map
If we wanted to add cities to this, because we’re interested in looking at a city level population, we can do this using geom_point. Let’s first create some sample data to do this.
gsa_data <- data.frame(
City_name = c("Cologne", "Munich", "Vienna", "Bern", "Berlin", "Hamburg", "Kassel", "Graz"),
Count1 = c(19, 4, 2, 5, 10, 43, 18, 7),
Count2 = c(20, 5, 1, 3, 21, 57, 28, 4),
Proportion = c(38.78, 44.44, 66.67, 62.5, 32.26, 43.0, 39.13, 63.64),
Long = c(6.9578, 11.5755, 16.3731, 7.4474, 13.3833, 10, 9.4912, 15.4409),
Lat = c(50.9422, 48.1372, 48.2083, 46.948, 52.5167, 53.55, 51.3166, 47.0749))
gsa_data
## City_name Count1 Count2 Proportion Long Lat
## 1 Cologne 19 20 38.78 6.9578 50.9422
## 2 Munich 4 5 44.44 11.5755 48.1372
## 3 Vienna 2 1 66.67 16.3731 48.2083
## 4 Bern 5 3 62.50 7.4474 46.9480
## 5 Berlin 10 21 32.26 13.3833 52.5167
## 6 Hamburg 43 57 43.00 10.0000 53.5500
## 7 Kassel 18 28 39.13 9.4912 51.3166
## 8 Graz 7 4 63.64 15.4409 47.0749
Note that we again have a dataset which contains both the linguistic information (here the counts and proportion) and the geolocation information. With this, we can map the data using the cities.
First, we again use our coordinates to create the basic map of the GSA, just as we did before. Only in the geom_point layer do we add the city data.
ggplot() +
geom_polygon(data = GSA_coord,
aes(x = long, y = lat, group = region),
# sets outline and fill sets the filling of the GSA
colour = "black",
size = 0.1,
fill = "snow3") +
coord_map(xlim = c(4.5, 17), # this cuts the map to the coordinates we need
ylim = c(45.5, 55)) +
theme_minimal() +
geom_point(data = gsa_data, # here we add the cities to our map
aes(x = Long, y = Lat, col = Proportion, size = (Count1+Count2)),
alpha = 0.9) +
guides(size = FALSE) +
scale_color_gradient(low = "seagreen3", high = "mediumpurple3") +
ggtitle("Feature 1 vs Feature 2 in the GSA") +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
plot.title = element_text(hjust = 0.5))
What this map shows us is the proportion of usage of the two feature in the given cities. Germany shows more feature 1 use, whereas Austria and Switzerland tend to use more feature 2 in our made-up data. The proportion baseline is feature 1. The size of the cities is dependent on the occurrences of both features combined.
As the last step we want to save our map.
ggsave("germany_map.png", width = 6.5, height = 5.5)